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On-line learning of probability distributions is an- 
alyzed from the field theoretical point of view. We 
can obtain an optimal on-line learning algorithm, since 
renormalization group enables us to control the num- 
ber of degrees of freedom of a system according to 
the number of examples. We do not learn parameters 
of a model, but probability distributions themselves. 
Therefore, the algorithm requires no a priori knowl- 
edge of a model. 
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The methods of statistical mechanics have pro- 
vided the framework to analyze learning and gen- 
eralization by neural networks, and have achieved 
considerable success. This implies the fundamen- 
tal connection between learning and statistical me- 
chanical theories. 

For example, Bialek et al. Q| made clear the 
batch learning problem of probability distributions 
from field theoretical point of view. They opti- 
mally controlled the number of degrees of freedom 
of a learning system through the scaling, which 
was naturally introduced by quantum field theory. 
It is crucial to control the number of degrees of 
freedom in learning problems. Learning a lot of 
features requires us to introduce a large number 
of degrees of freedom, while too many degrees of 
freedom increase ambiguities when we are given a 
small number of examples. This problem has been 
considered in various ways 

Can we expect that the scaling leads to an op- 
timal algorithm also in on-line learning problems? 
On-line learning has the advantages of less stor- 
age of data and less peak computation than batch 
learning. Some elaborated algorithms achieve the 
same optimal learning rate as the corresponding 
batch ones ||. I n this letter, we will show that the 
scaling analysis derives an optimal on-line learning 
algorithm, when we identify the number of exam- 
ples with a scale parameter. 

We observe a sequence of D-dimensional sample 



points X! , x 2 , . . . , x^v , which are drawn indepen- 
dently from an unknown probability distribution 
Q*(x). How can we infer the distribution, which 
is specified by an infinite number of parameters in 
general? A finite number of sample points do not 
enable us to determine Q*(x) uniquely, but only 
allows us to make a probabilistic description. We 
start from the probability of a distribution Q(x), 
which is given via Bayes' rule [§. 



P[Q\xi, . . .,x N ] = 
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p[Q] 
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Here, P[Q] denotes a 'prior distribution', which 
is a probability density in the space of probability 
distributions, and limits the function Q to possible 
classes. 

Let us consider the properties of the prior dis- 
tribution. Introducing a field valuable </>(x) which 
takes a value — oo < 4> < 00 1 we rewrite the distri- 
bution Q(x) as Q(x) = e-^/l . We note that I 
is not a parameter but a unit length scale, which 
is different from the case of Ref. g. It only plays 
a role to ensure the consistency of dimensions, and 
is fixed to be one in numerical studies. To make 
our learning problem well-posed, it is known that 
the prior distribution must not give ultraviolet di- 
vergences. Therefore, we choose such a prior dis- 
tribution as imposes that the function </>(x) has to 
be smooth enough (2a > D): 



P[Q] = ±- exp 



2a-D 



~ J d D xF(cj)) 



Jo I 



(2) 
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Hereafter, we will use the notation d" = 
d Xil • • • d Xia , and mean by the repeated indices 
i\, . . . , i a the sums from 1 to the dimension D. We 
note that we have introduced an unknown function 
F(</)(x)), which will be determined later so that 
we may treat fluctuation effects in a renormaliz- 
able way. The delta function gives the constraint 
to normalize the probability distribution, and the 
factor Zq is a normalization constant. 
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Putting the prior (§) into Bayes' rule (jl|), 
we obtain the partition function as Z = 
^/^/^expI-i^A)], where 



r2ot-D 



5 = 



d D x(d^) 2 + i /" cPzi^) 

+ 7V f d D xP N <f). (3) 



The constant <? counts fluctuation effects, and is 
set to be one in numerical analysis. -P/v(x) is the 
distribution of sample data, defined as Pn( x ) = 
-k J2iLi S D (TC—Xi). The action d|) explicitly shows 
that we have no scale parameter other than the 
number of example data N. Therefore, we can con- 
clude that the number N sets the length scale to 
observe an unknown distribution and determines 
the behavior of a fluctuating system. 

In order to proceed further, we will consider the 
asymptotic situation N — > oo. We expand the ac- 
tion (|^) around classical solutions, and perform the 
functional integration of </>(x). The classical solu- 
tions </>(x) and A arc defined by the following two 
equations. 

(-l) a l 2a A a 4> + F' -i\e-t = -Nl D P N , (4) 



i 



= 1 



(5) 



If we apply the normalization condition (^|) to the 
integration of the equation (||), we can find that 
i\ = N + J d D xF'(4>(x))/l D , which determines the 
mass term in the action (||). 

It is straightforward to evaluate the functional 
integration of </>(x). We will calculate the leading 
corrections to the classical solutions. Then, it is 
enough to restrict ourselves to the quadratic order 
estimation. Furthermore, we will consider only the 
leading order effect in the N — > oo limit. The 
integration up to quadratic terms is given by the 
ratio of functional determinants. We can evaluate 
it in a standard way [Q. As a result, the effective 
action is found to be 



S(<j>, A) + gR / d u x 



NQ 

i2a-D 



(6) 



Here, the constant R is defined as R = 
1/{4tt) d ^-HD r(f)sm£7T. 

Now that we have obtained a local correction 
term, we can determine the unknown function 
F((f>). As is easily seen, a counter term can be 
produced only through the following definition of 
the function F(4>): 



F(cj>) 



k e 
k - 



gR N'' 



(7) 
(8) 



Here, we have introduced a parameter fc, which 
has the observed value at a scale N. Since a bare 
parameter fco does not depend on the scale A, we 
obtain the renormalization group equation which 
describes the scaling behavior of the parameter k. 



dk 
dN 



fJ 



DR 



1 



2a N 1 ~ D l 2a 



(9) 



As a result, we can find the scaling form of the 
parameter to be k ~ gRN D / 2a , which does not 
depend on the initial value of k. 

The scaling behavior is understood from the fol- 
lowing point of view. In learning problems, the 
number of example data N is considered to play 
a role of setting our observation scale. When we 
have received a small number of data points, we 
observe a signal source at a low resolution. The 
more examples we have received, at the higher res- 
olution we can investigate the source. The sig- 
nal source with a true probability distribution is 
a bare quantity which exists independently of our 
observation at a length scale. The renormalization 
group equation (^|) means that we should scale the 
parameter k in order to respect the invariance of 
the true probability distribution under the change 
of the scale N. Therefore, such a scaling of the pa- 
rameter k is expected to lead to the optimal per- 
formance. 

On the other hand, we have the relation fco — k 
if we do not consider the fluctuation effects in 
Bayesian setting. This is just the case of maximum 
likelihood estimation. Again, the invariance of fco 
under the change of N requires that we should not 
vary the parameter fc with the number of exam- 
ples N. It will be seen later that the scaling of 
the parameter fc plays a crucial role to achieve the 
optimal learning rate. 

An on-line learning algorithm is how to change 
our hypothesis about an unknown distribution af- 
ter receiving a new example. It is given by the re- 
cursion relation of the expectation value < </>(x) > 
of the field </>(x). In our approximation level, the 
expectation value is just the classical solution </>(x). 
Therefore, an on-line learning algorithm is directly 
obtained from the variation of the equation (Q). 

A < £y(x) >~ - / d D x' Gat(x,x') 
9 J 

S D (x' - xjv+i) 



l D 

D dk N -£ <0„(x')> 

2al D dN 



A(i\) N c _ <4>n( ^ )> 



(10) 
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Hereafter, we will explicitly attach the number of 
examples N to denote the scale. The learning rate 
Gjv(x, x') is a Green's function, and defines a local 
bin size £/v(x). 



Gjv(x,x') 
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^at(x) = I 



(i\) N e 
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-<0«(x)> 



(12) 



Here, r is the distance between the points x and 
x', and the function K o _-, is a modified Bessel 

2 * 

function of the second kind. 7„ is defined to be 
7„ = e J ( 2n + 1 ) ^e -1 ^ . The Green's function is real 
and regular for any r > 0. 

The first term on the right-hand side of ( |!o|) 
gives the effect of a new datum xjv+i, which is 
necessarily local. The minus sign of the term en- 
sures the increase of the distribution < Qn >— 
e -«t> N > jiD at the pomt xhc second and third 
terms show the influence of the changes of the pa- 
rameters {iXjN and k N respectively. The Green's 
function smoothes them in the local length scale 
£jv (x) . We can see from the equation (|l^) that the 
length scale should be small only where we find a 
lot of example data and when the number of exam- 
ples N is large. Thus, the bin size £/v(x) penalizes 
that we prepare too many degrees of freedom in or- 
der to reconstruct a probability distribution from 
given data. 

We note that we have not yet obtained 
the explicit form of the change A(iA)jv- We 
report having found that A(iA)jv ~ 1 — 
D 2 k N /4a 2 Ne-( 1 -^ < ' t > N ( XN + 1 > , which is ob- 
tained from the variation of («A)at = N — Dk^ /2a- 
Jd D x(Q N /l 2a - D ) D ^ 2a . 

In the rest, we will discuss the average per- 
formance of the algorithm. Since we consider 
the asymptotic situation, we may expand the al- 
gorithm around the true probability distribution 
Q* = /l D , which is an ultraviolet fixed point. 
Then, we define an error ejv(x) as the difference 
ejv(x) =< 4>n(x) > — </>*(x), and will evaluate the 
dynamics of the error. 

We have expanded the Green's function in the 
convolution of the equation ( |io|) around the fixed 
point, and found that it decays as 1/iV, which is 
known to be optimal in the on-line learning by neu- 
ral networks Putting < 4>n >= 4>* + ejy into 



the expansion, we can obtain the evolution equa- 
tion of the error cn. However, it requires care to be 
solved, since it has the iV-dependence which is not 
extracted explicitly. The equation depends on a 
new datum x^v+i through the terms <5 £) (x — xjy+i) 
and A(iA)at. The data dependence is averaged 
over many steps of evolutions, which is found from 
the relations J5J]: 



JV 



^ < 5 D (x-x l+ i) = 7Vg*(x)+ViV : j0 (x), (13) 

z=l 

« p(x)p(x') » = Q*(x)<5 D (x - x'). (14) 

Here, the function p(x) is a fluctuating density. 
The equation ([l3]) shows that we have to separate 
the data-dependent terms into a leading order and 
a sub-leading one. Therefore, we have to adopt the 
average J d D XN+iQ*(^N+i)A(iX)N as the leading 
order of A(zA)tv, and A(zA)jv — / cPxjv+i 
<5*(xjv+i)A(iA)jv as the sub-leading one. Then, 
the leading order term e^(x) of the large N ex- 
pansion of the error ejv(x) = e° N (x) + ejy-(x) H 

obeys the equation: 



Ne° N+1 ( X ) - (W-l) e °r(x) 



= l-{Q*)- 1 6 D ( X -x N+1 )+g 



D 2 R 



4a 2 N l ~& 



1 



(i D Q*y 



d D x- 



Q* 



(i D Q*y 



(15) 



Summed up with N changed 1 to N, the equation 
T5J) gives the following result. 



I_ . (xrV (x) + 9i JL 



(l D Q*) 



d D x- 



Q* 



(i°Q*y 



(16) 



The sub-leading term e]y( x ) ^ s found to be 
0(\ogN/N) [a = D] or O(JV^-f) [a ^ D}. How- 
ever, we can prove that it does vanish if and only if 
a = D. Then, the equation ( |l6| ) is exact up to sub- 
leading order, and leads to the optimal error decay 
(pT[). This means that we can infer distributions 
most effectively, when we choose their candidates 
among C 2D -c\&ss functions in D-dimensions. 

From the equation ( |l6| ) , we can show that the av- 
erage of the squared error ejv(x) 2 gives an univer- 
sal result. This is consistent with the definition of 
the reparameterization invariant distance between 
two distributions p(x, £) and p(x, £+d£), which are 
specified by parameters £ l ||: ds 2 = gijd^d^ = 
E[^\ogp £logp]d?d& = E[(d\ogp) 2 }. 
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We have found the quadratic error as 

« / d D x Q- e„* » = (17) 

Here, the coefficients V p and V x denote the vol- 
umes of momentum space and of coordinate one, 
respectively. The constant V p V x gives the num- 
ber of all the possible configurations of the system. 
Since it is independent of how to describe the sys- 
tem, we have reproduced the universal asymptotic 
behavior, which is well known in neural network 
models Q. Thus, we have obtained the 1/N de- 
cay of the quadratic error up to sub-leading order, 
which is considered to be optimal as in the case 
of neural networks Q . We note that the optimal 
behavior (|l7|) can be obtained, if and only if we 
develop the parameter kjq according to the renor- 
malization group equation (||). 

Figure [l] shows the result of the numerical sim- 
ulation of the algorithm (|l0|), which is applied to 
the learning of a two-dimensional Gaussian distri- 
bution. It is obvious that the quadratic error is 
asymptotically linear to the inverse of the num- 
ber of examples 1/N . This ensures the analytical 
result (0. 

Finally, we will point out the relation to previous 
works and make concluding remarks. Bialek et al. 

gave an excellent field theoretical formulation 
of the batch learning of probability distributions. 
They controlled the number of degrees of freedom 
through the scaling of an infrared cut-off parame- 
ter I. Especially, they obtained the optimal scaling 
of a bin size £n oc TV -1 /' 2 "" 1 " -', which includes the 
well-known result £jy oc iV -1 / 3 in one-dimensional 
space D = a = 1 ||. On the other hand, we 
have analyzed the on-line learning of probability 
distributions. In our case, renormalization group 
enables us to control a bin size ( |l2| ) and leads to 
the optimal change of the hypothesis (|o|) through 
the scaling of the parameter fc^r. We can find, from 
(iA)jv ~ N and a = D, the optimal scaling of the 
bin size £/v oc N~ X / 2D in on-line learning scheme. 

Thus, renormalization group gives an on-line 
learning algorithm in a natural way, which we have 
proved to be optimal. Our discussion is so general 
that we may adopt it as a principle to derive opti- 
mal on-line learning algorithms. Furthermore, we 
could expect that the optimal error decay can be 
also achieved up to a desired order, if we extend 
the field theoretical analysis to the order. 
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FIG. 1. The quadratic error 

(17) in learning the two-dimensional Gaussian distri- 
bution Q*(x) = e- (x2+y2)/2 /2-K [-1/2 < x,y < 1/2]. 
We have performed the algorithm (10) with the area 
divided into 40 x 40 pieces. 
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